Introduction

This document discusses analyses of a mental health data set.

Section 1 conducts exploratory data analysis and data wrangling on the dataset.
We obtain a dataset with imputed values and omitted rows and columns. The author is Alexander Ng.

Section 2 contains an analysis with hierarchical clustering and k-means analysis. The author is Alexander Ng (primary).

Section 3 contains an analysis using a multiple factor analysis (MFA) which is a generalization of principal components analysis (PCA). The author is Alexander Ng.

Section 4 contains an analysis using a support vector machine (SVM) approach using suicide as the response variable. The authors are Philip Tanofsky and Scott Reed.

Section 5 contains a supplementary analysis using binary logistic regression as a baseline model to evaluate the SVM results and contains additional conclusions. The author is Randall Thompson.

Section 6 presents our R code and technical appendices and references. The document was merged and edited by \(\color{red}{\text{TBD}}\).

1 Exploratory Data Analysis

We load the data using the readxl package which is part of tidyverse. This package allows us to do name repair to the column headers of the Excel spreadsheet. Using .name_repair equal to universal, we transform all column names with spaces or special characters into better named counterparts. For example, Hx of Violence is transformed into Hx.of.Violence and MD Q1k into MD.Q1k.

The original dataframe contains 175 observations (i.e. survey participants) as rows and 54 columns as variables.

## [1] 175  54

The columns contain both qualitative and quantitative variables.
Moreover, some columns represent categorical data but is encoded as numerical values.

Thus, we will perform 4 data transformations to obtain a cleansed and fully populated dataframe:

  1. Removal of observations with significant data issues
  2. Remove of columns where data is meaningless or missing excessive values.
  3. Imputation of missing values in those columns where business analysis allows sensible choices.
  4. Rescaling or conversion of data to factor or character values.

In the final section, we make several demographic comparisons about the data to the general populations of the US.

1.1 Missing Data Treatment

1.1.1 Removal of Observations

Four observations had missing values for Alcohol. But those observations also have a large number of correlated missing data columns as shown below. Excluding these 4 observations eliminate missing data for 6 columns and undefined values for several other variables. Thus, we retain 97.7% (or 171) of the original observations.

1.2 Removing Columns

We remove columns Initial, Psych.meds. because they have no useful information or have over 50% missing values.

## [1] 171  54
## [1] 171  52

1.3 Imputing Values

We explain our decisions on the handling of the remaining imputed values here.

Data summary
Name a
Number of rows 171
Number of columns 52
_______________________
Column type frequency:
numeric 8
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Court.order 1 0.99 0.09 0.28 0 0 0 0 1 ▇▁▁▁▁
Education 7 0.96 11.90 2.19 6 11 12 13 19 ▁▅▇▂▁
Hx.of.Violence 7 0.96 0.24 0.43 0 0 0 0 1 ▇▁▁▁▂
Disorderly.Conduct 7 0.96 0.73 0.45 0 0 1 1 1 ▃▁▁▁▇
Suicide 9 0.95 0.30 0.46 0 0 0 1 1 ▇▁▁▁▃
Abuse 10 0.94 1.33 2.12 0 0 0 2 7 ▇▂▁▁▁
Non.subst.Dx 18 0.89 0.44 0.68 0 0 0 1 2 ▇▁▃▁▁
Subst.Dx 19 0.89 1.14 0.93 0 0 1 2 3 ▆▇▁▅▂

1.3.1 Abuse

There are only 10 remaining observations with missing Abuse data. As we see below, the most frequent response is 0 – i.e. No.

Moreover, the conditional distribution of survey response score ADHD.Total and MD.TOTAL seems unchanged for this subpopulation. So we make the imputation.

## [1] "Conditional Mean ADHD.Total: 34.9 Population Mean ADHD.Total: 34.5087719298246"
## [1] "Conditional Mean MD.Total: 10.6 Population Mean MD.TOTAL: 10.0760233918129"

1.3.2 Suicides

We choose to impute Suicide=0 for those observations where Suicide is not defined. The number of incidents is low, the most common response is 0 (No) and the conditional mean of ADHD.Total and MD.Total is unchanged for this subpopulation.

## [1] "Conditional Mean ADHD.Total: 36.5555555555556 Population Mean ADHD.Total: 34.5087719298246"
## [1] "Conditional Mean MD.Total: 10.3333333333333 Population Mean MD.TOTAL: 10.0760233918129"

1.3.3 History of Violence and Disorderly Conduct

The non-responses for History of Violence and Disorderly conduct are perfectly correlated in our dataset. Also, because the questions are so conceptually linked, we consider the frequency table of their joint distribution. As we see below, the most frequent scenario is Disorderly Conduct = whileHistory of Violence` = 0 twice as frequently as any other scenario.

Moreover, the conditional mean of ADHD.Total and MD.TOTAL is unaffected for this subpopulation.

Disorderly Conduct
0 1 NA
0 45 79 0
1 0 40 0
NA 0 0 7
Note:
Each row shows counts for History of Violence 0=No, 1=Yes
## [1] "Conditional Mean ADHD.Total: 35.4285714285714 Population Mean ADHD.Total: 34.5087719298246"
## [1] "Conditional Mean MD.Total: 10.4285714285714 Population Mean MD.TOTAL: 10.0760233918129"

1.3.4 Education

We find that the modal years of Education is 12 and impute that value where none is provided. This corresponds to attainment of high school degree but not college.

## 
##    6    7    8    9   10   11   12   13   14   15   16   17   18   19 <NA> 
##    2    2    5   12   12   23   65   15   14    1    7    2    3    1    7

1.3.5 Substance Use

##       
##         0  1  2 <NA>
##   0     8 22 12    0
##   1    53  6  2    0
##   2    29  4  2    0
##   3    12  2  0    0
##   <NA>  0  1  0   18

We infer some level of substance related drug use when Non substance related use is low because of the cross-tabulation frequency table. We also assume Non-substance related drug use is 0 when no answer is provided.

1.4 Transforming Variables

In this section, we apply the above data imputation rules to the variables with missing values. Next, we construct factor equivalent variables with the naming convention of a f prefix to any abbreviated version of the original variable name. Lastly, we remove the older version of the columns with their factor equivalent.

A table below shows the translation from old to new variables along with the range of permitted values.

Note the following transformation from the original variable to the new one as follows:

Conversion of Old Variables to New
OldVar NewVar Range
Sex fSex M,F
Race fRace WH,AF,HI,AS,NA,OT
Court.order fCO Yes,No
Abuse fAbuse 0-7 (factor)
Hx.of.Violence fHViol Yes,No
Disorderly.Conduct fDCond Yes,No
Suicide fSuic Yes,No
Non.subst.Dx fNonDx 0,1,2 (factor)
Subst.Dx fSubsDx 0,1,2,3 (factor)

We export the dataset for reference purposes.

1.5 The Survey Is Not Representative of the US Population

Because the dataset is anonymized, we have no background context of the study’s location, purpose and population. Comparisons with available data sources suggest that the survey population is highly non-representative by gender, race, educational attainment or history of violence of the US general population. We present evidence for each discrepancy in turn.

Consequently, we should not apply statistics inferences and machine learning predictions from the dataset and our models to the general US population. The researcher needs to condition any inferences on the intended and true population.

1.5.1 Education

The Educational attainment of the survey population is below US average. US Census data shows US population 18 years and older have graduated high school or better is 89%. By comparison, in our sample only 68% have graduated high school or better. (We use the 18 years and older population to match the observed age range in the sample.) The below table shows the frequency of educational attainment by years of education within the survey population.

## 
##  6  7  8  9 10 11 12 13 14 15 16 17 18 19 
##  2  2  5 12 12 23 72 15 14  1  7  2  3  1

Summing up the count of those with 11 years or less of education, we obtain the figure:

## [1] 0.6725146

Below we show the US Census trend for educational attainment about adults 24 and older. The 18 years and older population is contained in the underlying Excel spreadsheet, however.

America's Education: Population Age 25 and Over by Educational Attainment[Source: U.S. Census Bureau]

1.5.2 Gender

The gender within the survey population is heavily skewed towards males at 56.6% instead of 49.1% as in the general population.

Percentage by Gender
F M
44.4 55.6

1.5.3 Race

The survey population is not representative of the racial composition of the US.

AF HI OT WH
race_freq 97.0 1.0 2.0 71.0
56.7 0.6 1.2 41.5

The US Census Bureau shows Whites are 76.3%, Blacks are 13.4%, Hispanics are 18.5%, Asians are 5.9%, American Indian are 1.3% according to 2019 data available on this link.

We conclude the Blacks are overrepresented in the survey 57.1% (survey) vs. 13.4% (census), Whites are underrepresented 41.1% (survey) vs. 76.3%. Hispanics are underrepresented 0.6% (survey) vs. 18.5% (census) and Asians are not represented at all 0% (survey) vs. 5.9% (census).

2 Multiple Factor Analysis of the Mental Health Data

Multiple Factor Analysis (MFA) is a generalization of principal components analysis (PCA) that allows the combined use of quantitative and qualitative variables in a single model. Because PCA does not support qualitative variables, I have chosen to implement MFA for this assignment. MFA can be viewed as a weighted combination of PCA for quantitative variables and Multiple Correspondence Analysis (MCA) for qualitative variables. While a number of the model outputs can be intrepreted in a similar fashion as in PCA, MFA allows variables to be grouped. Thus, we can explore the relations between individuals, variables and groups of variables.

An exposition of the mathematical background of MFA is too lengthy to include here. I refer the reader to the online course on PCA, MCA, MFA taught by Francois Husson and the textbook by Jerome Pages and Francois Husson, “Exploratory Multivariate Data Analysis by Example Using R”,(http://factominer.free.fr/course/books.html) They also implement and support the software package FactoMineR and FactoShiny used herein.

In the next section, I describe the model setup and the mapping of variables into groups. Afterwards, I report my findings at overall performance of the MFA model and the contribution of groups of variables Then, I go deeper into the individuals looking at how the biplot separates the individuals by gender, race and behavioral categories. Then, I go deeper into the variable correlations and loadings on the MFA dimensions.

2.1 Model Setup

To run MFA, we need to group the variables by type: quantitative or qualitative and by role: active or supplementary. Active means the variable is used in estimating the MFA factors. Supplementary means the variable is not used to calculate the MFA dimensions but its values are fitted ex-post into the MFA dimensions for interpretative context of the active variables.

A variable group consists of a set of columns of the dataset of one type and one role. We are required to partition all variables into groups. MFA requires at least 2 groups to be run. For each group, a PCA analysis is run - either on the actual values for a quantitative group or on a 0-1 matrix of dummy variables for a qualitative group.

We consider an MFA analysis on the complete set of variables excluding the variables ADHD.Total and MD.TOTAL which are sums of survey scores. Since MFA can handle individual survey scores, we prefer to use those columns to have more granular insights. Excluding those totals avoids linear dependencies and zero eigenvalues in the MFA.

Our dataset amfa1 will be used in the MFA analysis. It has 50 variables and 171 observations.

## [1] 171  50

We partition the variables into 8 groups as shown in the table below. They are 4 groups of active quantitative variables and 2 groups of active qualitative variables and 2 groups of supplementary qualitative variables.

MFA Variable Groups
GroupName Type Role NumVars Variables
AE Active scaled 2 Age, Education
Abuse Supp nominal 1 fAbuse
Vio Active nominal 4 fCO, fHViol, fDCond, fSuic
Dx Active nominal 2 fNonDx, fSubsDx
ADHD Active scaled 18 ADHQ.Q1 … ADHD.Q18
MD Active scaled 15 MD.Q1a, … MD.Q1m, MD.Q2, MD.Q3
Sub Active scaled 6 Alcohol, THC, Cocaine, Stimulants, Sedative.hypnotics, Opioids
Demo Supp nominal 2 fSex, fRace

We choose to assign fSex, fRace, fAbuse as supplementary variables. Omitting potentially sensitive data like Race, Sex and Abuse may be beneficial in machine learning applications. Secondly, they may give insight into well the model separates individuals by other variables. If the model is accurate, the omitted variables should also be separated by the MFA analysis.

2.2 Model Performance

2.2.1 Scree Plot

Due to the diversity of survey data, the dimension reduction does not dramatically explain most of the variance with 2 or 3 dimensions. That is, there is no sharp “elbow” in the scree plot of the MFA dimensions below. The first 2 dimensions capture 11.4% and 10.9% of the variance respectively. The first 10 dimensions are required to explain 65.2% of the cumulative variance.

Eigenvalues of MFA
eigenvalue percentage of variance cumulative percentage of variance
comp 1 1.9 11.4 11.4
comp 2 1.8 10.9 22.3
comp 3 1.4 8.2 30.5
comp 4 1.1 6.8 37.3
comp 5 0.9 5.6 42.9
comp 6 0.9 5.2 48.1
comp 7 0.8 4.7 52.8
comp 8 0.7 4.2 57.1
comp 9 0.7 4.1 61.2
comp 10 0.7 4.0 65.2

2.2.2 Group Plot

Next we define and explain the group plot results. The group plot represents the \(J\) variable groups as points on a unit square. The X-axis represents MFA dimension 1 - represented by the vector \(v_1\) and Y-axis represents MFA dimension 2 - with vector \(v_2\). The variable groups \(K_1, \cdots , K_J\) would be the 8 groups previously set up.

The group plot below shows a statistic \(L_g(K, v_i)\) for MFA principal components \(v_1, v_2\). \(L_g\) is the projected inertia of the variables in group \(K\) on \(v_i\) divided by the largest eigenvalue of PCA on group \(K\). More formally,

\[ L_g(K_j, v_1) = \frac{1}{\lambda_{1}^{j}} \sum_{k \in K_j} cov^2(x_k, v_1)\]

In fact, MFA defines the first principal component \(v_1\) to maximize \(L_g\) over all variable groups.

\[\underset{v_1 \in R^I}{\operatorname{arg max}} \sum_{j=1}^{J} L_g(K_j,v_1)\]

We know that \(0 \leq L_g(K,v_1) \leq 1\) because the projected inertia is bounded by the largest eigenvalue \(\lambda_{1}^{j}\). When \(L_g(K_j,v_1)=0\), the variable group is uncorrelated to the first MFA dimension. When \(L_g(K_j,v_1)=1\), the variable group is perfectly correlated with the first MFA dimension.

The group plot tells us:

  • AE (Age, Education) group is somewhat projected to dimension 1 but not projected to dimension 2.
  • Dx, Sub, and Vio groups are somewhat projected to dimension 1 and weakly projected to dimension 2.
  • Demo, Abuse are not projected to dimension 1 and weakly projected to dimension 2.
  • ADHD is not projected to dimension 1 but strongly projected to dimension 2.
  • MD is weakly projected to dimension 1 and strongly projected to dimension 2.

In brief, the survey responses to ADHD and Mood are strong descriptors but play a secondary role because Age, Education, Drugs, Substance and Violence features explain more of the variation.

2.2.3 Qualitative Variables Biplot

In the following biplot, we show the barycenters of all active qualitative (categorical) variable values.
The coordinates of a barycenter are the means of the individuals coordinates along MFA dimensions 1 and 2 respectively.

For example, we highlighted the values of fSuic as \(\color{green}{\text{`fSuic_Yes`}}\) and \(\color{red}{\text{`fSuic_No}}\). We observed that individuals who answered yes to the attempted Suicide question are centered in the upper right corner (i.e. Quadrant I) while those who answered no to the attempted Suicide question are centered in Quadrant III.

When the barycenters of a categorical variable’s values are far apart, our MFA is using the categorical value in defining the principal axes.

The Barycenters plot suggests:

  • Violence related variables have barycenters separated by the origin with No categories for history of violence, suicide, court order in Quadrant III while their Yes responses are centered on upper Quadrant I.
  • SubsDx is well separated. It takes 4 categorical values. Individuals with SubsDx=0 are located in left of Quadrant II while those with SubsDx=1,2,3 are clustered in lower right half. We cannot readily distinguish degrees of SubsDx among 1,2 or 3.
  • NonDx is well separated but its quadrant orientation is flipped with SubsDx. It is worth investigating why Non-substance related Dx use is opposite of Substance-related Drug use. Examining the contingency table of non-substance related drug use vs. substance related drug use, we see not using Non-substance drug use is strongly associated with substance-related drug use.
Substance-related Dx Use
NonDrugUse 0 1 2 3
0 0 8 53 29 30
1 1 22 6 4 3
2 2 12 2 2 0

2.2.4 Correlation Circle

The following plot shows the correlations of each quantitative variable \(X_j\) to the MFA dimension 1 and 2 vectors \(v_1\) and \(v_2\). That is, \(x(X_j) = cor(X_j, v_1)\) and \(y(Y_j)=cor(X_j, v_2)\).

There are 4 active quantitative variable groups. Each group is assigned its own color and all member variables in a group are plotted on the circle. Hence it is a little busy to see the position of each survey question variable on the below circle plot.

However, we can draw several conclusions:

  • ADHD survey responses are all strongly correlated to each other and to MFA dimension 2.
  • MD survey responses are more dispersed but still well correlated to MFA dimension 2.
  • Substance use is quite dispersed but positively correlated to MFA dimension 1. Of these, THC, Cocaine are dominant. Stimulants and Sedative.hypnotics are not well projected to MFA \(v_1\) or \(v_2\).
  • Age and Education are negatively but moderately correlated to MFA dimension 1.
  • Substances like THC and Cocaine are positively correlated to each other and MFA dimension 1 but somewhat negatively to dimension 2.

2.3 Intepreting the Principal Components

Now we examine the granular variable interaction with the first two principal components of the MFA. Our goal is to interprete the two principal axes based on interpretation of the variables in their business domain.

2.3.1 Dimension 2 - Survey Information

Previously, we observed that ADHD and MD questions projected meaningfully onto the 2nd MFA dimension but not the first. Here, we will look at their correlations and covariances in detail.

The plot below displays the percentage contribution of each variable to the covariance with MFA dimension 2 and its correlation to the same. We rank the variables in the table and show the top 5 and bottom 5 variables. Thus, we will see which survey questions have the greatest explanatory power.

The table and chart show the most important ADHD survey questions are:

  • ADHD.Q8
    • 70% correlation
    • 2.91% of explained covariance
    • “How often do you have difficulty keeping your attention when you are doing boring or repetitive work?”
  • ADHD.Q10
    • 67% correlation
    • 2.7% of explained covariance
    • “How often do you misplace or have difficulty finding things at home or at work?”
  • ADHD.Q7
    • 66% correlation
    • 2.56% of explained covariance
    • “How often do yo make careless mistakes when you have to work on a boring or difficult project?”

In general, all ADHD question variables are consistently projected and significant.

Some MD questions are also important and significantly projected but not consistently.

  • MD.Q1g
    • 69% correlation
    • 4.67% of explained covariance is most important
    • "…you were so easily distracted by things around you that you had trouble concentrating or staying on track
  • MD.Q2
    • 62% correlation
    • 3.76% of explained covariance
    • “… have several of these ever happened during the same period of time?”
  • MD.Q1b
    • 59% correlation
    • 3.41% of explained covariance
    • “… you were so irritable that you shout at people or started fights or arguments?”

However, some MD survey questions had no correlation to the MFA dimension 2 and may be ineffective predictors of related variables.

  • MD.Q1c
    • 7.7% correlation
    • 0.06% of explained covariance
    • “… you felt much more self-confident than usual?”

Dim 2 - Corr/Covar%
Corr Var-Cntrb
ADHD.Q8 0.70 2.91
MD.Q1g 0.69 4.67
ADHD.Q10 0.67 2.70
ADHD.Q7 0.66 2.56
ADHD.Q9 0.66 2.56
fDCond_Yes -0.16 0.25
fSubsDx_1 -0.22 1.09
Cocaine -0.26 2.60
fNonDx_0 -0.39 1.53
fSuic_No -0.46 2.29

2.3.2 Dimension 1 - Education, Substances and Conduct

Dim 1 - Corr/Covar%
Corr Var-Cntrb
fDCond_Yes 0.62 3.72
fNonDx_0 0.55 2.86
THC 0.48 8.54
MD.Q1L 0.41 1.60
MD.Q1a 0.41 1.58
fNonDx_1 -0.36 3.30
fNonDx_2 -0.36 3.80
Education -0.52 14.29
fDCond_No -0.62 10.42
fSubsDx_0 -0.66 10.41

2.3.3 Interpretation

Our earlier two subsections gives us information to interprete the first two principal components of the MFA.

The first MFA principal component \(F_1\) appears to differentiate:

  • Substance related Drug Use \(F_1 > 0\) is Yes vs. \(F_1 < 0\) is No.
  • Education \(F_1 > 0\) has low education vs. \(F_1 < 0\) has high education.
  • THC \(F_1 > 0\) uses Marijuana vs. \(F_1 < 0\) does not use Marijuana.
  • Age \(F_1 > 0\) is younger vs. $F_1 < 0 $ is older
  • Disorderly Conduct \(F_1 > 0\) has been disorderly vs. \(F_1 < 0\) has not been disorderly
  • MD Q1a \(F_1 > 0\) can get hyper vs. $F_1 < 0 $ does not get hyper and into trouble.

The second MFA principal component \(F_2\) appears to differentiate on the survey questions and mood related behavior.

  • Suicide attempts \(F_2 > 0\) is Yes vs. \(F_2 < 0\) is No.
  • MD Q1g can’t concentrate: \(F_2 > 0\) is Yes vs. \(F_2 < 0\) is No.
  • ADHD Q8 can’t pay attention when bored: \(F_2 >0\) is Yes vs. \(F_2 < 0\) is No.

2.4 Analysis of Individuals

Plotting the data cloud of individuals on the first 2 dimensions of an MFA analysis can provide important information about the predictive or discriminatory power of model which we may not see in the earlier plots. For each individual, we plot their coordinates on the MFA dimension 1 and 2 axes in a scatter plot.

If we color code each individual point by their value \(k\) of category \(K_j\) then we can test if the points are clustered in the biplot for any value \(k\). This will imply that the MFA dimensions 1 and 2 can separate those individuals by category \(K_j\) and potentially yield additional meaningful predictions.

2.4.1 Suicide, History of Violence and Disorderly Conduct

We examine the Viol group of categorical variables first. Do the MFA dimensions differentiate by these variables? The answer is \(\color{blue}{\text{Yes!}}\)

In the biplot below “Individuals by Suicide Attempt”, the green dots represents who attempted suicide. We see the green dots fall mostly above the \(X\)-axis and slightly concentrated on the right plane. Thus, Dimension 2 appears to differentiate by attempted suicide.

In the biplot below “Individuals by Court Order”, the green dots represents who obtained or received Court Orders. We see the green dots fall mostly on the right half plane. Thus, Dimension 1 appears to differentiate by Court Order = Yes but not when Court Order = No.

In the biplot below “Individuals by History of Violence”, the green dots represents who obtained or received Court Orders. We see the green dots fall mostly on the right half plane but some points falls into part of the left half plane. Thus, Dimension 1 appears to partly differentiate by when fHViol=Yes.

The majority of survey participants engaged in disorderly conduct. Dimension 1 is effective in differentiating those who did not engage in disorderly conduct. Those who did not engage in disorderly conduct fall in the left side of the plot.

2.4.2 Supplemental Variables: Race, Gender and Abuse

Moving to the supplemental qualitative variables, we will conclude that MFA has little or no ability to differentiate by these categories. We see below that Whites and Blacks are roughly evenly distributed across the biplot. There is no clear visual evidence that MFA can differentiate by Race along dimensions 1 and 2. However, remember that supplementary variables are not used in calibrating the MFA model.

Women seem to be more heavily concentrated in the upper half plane. Thus, MFA dimension 2 may somewhat differentiate by Sex.

Abuse is relatively challenging to visualize because it has 8 levels of data. We observe no obvious patterns to help differentiate Abuse levels. This is consistent with the low projection of Abuse in the previous Groups plot. Both the individuals biplot and the simpler barycenter biplot don’t make much sense. In the latter, the barycenters of the levels of Abuse show no pattern or trend.

2.4.3 Drug Use

There is strong evidence that Non-substance related drug use is differentiated by the model by not on the principal axes but a tilted version of Dimension 2. In the plot below, those who responded No to Non-substance related drug use fall mostly on the right half plane. Thus, Dimension 1 has significant power to differentiate that group. However, for responses of 1 and 2, those individuals are intermixed in the left half plane.

The plot below shows some clustering of No responses to Substance related Drug Use in the left half plane. Thus Dimension 1 has some ability to differentiate but the cluster boundary looks tilted.

2.5 MFA Wrap-up

The use of MFA in analyzing this mental health dataset has been successful in our opinion. The ability to handle both quantitative and qualitatively data in a coherent framework means we don’t have to shoehorn qualitative data into a quantitative form. The findings show that mental health responses on ADHD and Mood are secondary to the demographic, education and drug or substance usage and impulsiveness of the individual. Lastly, the ability to differentiate by attempted suicide bodes well for our prediction model using Support Vector Machine.

3 Appendices

3.1 References

Abdi, Hervé, and Lynne J. Williams. 2010. “Principal Component Analysis.” John Wiley and Sons, Inc. WIREs Comp Stat 2: 433–59 (http://staff.ustc.edu.cn/~zwp/teach/MVA/abdi-awPCA2010.pdf)

[MFA - Multiple Factor Analysis] (http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/116-mfa-multiple-factor-analysis-in-r-essentials)

Multiple Factor Analysis Playlist for using FactoMineR (https://www.youtube.com/playlist?list=PLnZgp6epRBbRX8TEp1HlFGqfMf_AxYEj7)

3.2 Code

We summarize all the R code used in this project in this appendix for ease of reading.

#{r ref.label=knitr::all_labels(), echo=T, eval=F} #